Testing goodness-of-fit via rate distortion 



Peter Harremoes 
Centrum voor Wiskunde en Informatica (CWI) 
Amsterdam, NL-1090 GB 
The Nederlands 
P.Harremoes@cwi.nl 



OS 

o 
o 

(N 



CO 



> 
(N 

cn 
o 
o> 
o 



x 



Abstract — A framework is developed using techniques from 
rate distortion theory in statistical testing. The idea is first to do 
optimal compression according to a certain distortion function 
and then use information divergence from the compressed em- 
pirical distribution to the compressed null hypothesis as statistic. 
Only very special cases have been studied in more detail, but 
they indicate that the approach can be used under very general 
conditions. 

I. Introduction 

There are many well-known examples of a fruitful interplay 
between information theory and statistics. It started with [1] 
and [2] and is well described in [3]. Information divergence or 
Kullback Leibler information plays a central role in measur- 
ing the distance between probability distributions. Statistical 
testing is often delicate if the sample size is small compared 
with size of the alphabet (sample space). If the alphabet is a 
continuous set the normal approach in statistics is to discretize 
the alphabet, but information is lost during discretization, and 
often it is not clear how one should discretize the space. 

Rate distortion theory was developed as a theoretical frame- 
work for lossy compression. An obvious example is image 
compression, but rate distortion theory often fails for this 
kind of application for three reasons. First of all it is often 
very difficult to specify an appropriate distortion function. 
Secondly, the statistics of the source is often not known. 
Thirdly, in most cases it is impossible to calculate the rate 
distortion function exactly and even a numerical calculation 
may be very involved due to the number of variables. 

Although rate distortion theory was developed for lossy 
compression we claim that the ideas are very useful for 
statistical analysis. 

II. Likelihood ratio testing 

On a finite sample space of size k one can use information 
divergence as statistics for testing goodness of fit. This is 
the likelihood ratio test. We want to test a null hypothesis 
Hq : P = Pq. An iid sample uo from P of size n is made 
and the null hypothesis is accepted if D (Emp n (ui) ||Pq) is 
smaller than some value and rejected if it exceeds this value. 
The critical value is determined by the significance level. If 
Pq is the uniform distribution then H (Emp n (u>)) = logfc — 
D (Emp n (lu) \\Pq) so in this case it makes no difference 
whether one uses entropy or information divergence. Using 
large deviation theory one will see that no other test is 
more Bahadur efficient than the likelihood ratio test. The 



distribution of 2nD (Emp n (u) \\Po) will will converge to a 
X 2 distribution with k degrees of freedom, so determining the 
values correspond to different significance levels is simple. 

This method cannot be used directly if the sample space is 
infinite and Pq is continuous. If Pq is a distribution on K with 
continuous distribution function F then a popular method for 
testing goodness of fit is the divide K into k bins of equal 
probability. As one want to keep points together if they are 
close on the real axis the bins should be chosen of the form 
(^p) ; F' 1 (|) [ . If / maps a point into its bin then 
Pq is mapped into a uniform distribution so we can use the 
entropy H (/ (Emp n (a;))) as statistic to test goodness of fit. 
The idea is then to increase the number of bins slowly as n 
increases. Recently it was proved that entropy is more Bahadur 
efficient than other power statistics if k is increased so slowly 
that the mean number of samples per bin n/k tends to infinity 
for n — ► oo, see [4], [5] and references in there. This condition 
will hold if for instance k — n 1 / 2 and this choice of number 
of bins will also ensure that distribution of entropy will be 
asymptotically Gaussian. 

It is easy to divide R into k bins of equal probability for 
a continuous distribution but it is not obvious how to do the 
same for distributions on R 2 or in higher dimensions. Even in 
one dimension it is far from obvious why the bins should be 
of equal probability. Maybe a different choice of bins would 
sometimes give a test that in one or another sense is more 
efficient. To get better founded criteria for how to choose bins 
we need a distortion function. 

III. The rate distortion test 

Consider a distribution Q on a set £1 with a distortion 
function d : f2 x £1 — > R. For a distortion level do the optimal 
coupling at distortion level do is given by a Markov kernel 
if? d '■ ^ ~~ ¥ M}_ (f2) . We shall use if?d to smooth the empiri- 
cal distribution so that we can compare it with the null hypoth- 
esis H , i.e. we shall use D (if?d {Emp n (u>)) \\if?d (Q)) as 
statistic for testing goodness of fit. There are various ways to 
approximate D (if?d (Emp n (cu)) \\if?d (Q)) numerically. We 
shall not discuss this problem. In general the rate distortion 
function and if?d a cannot be calculated exactly but using 
iterative methods like the Arimoto Blahut algorithm they can 
be approximated. We shall discuss three examples where the 
rate distortion function and if>d„ are given by explicit formulas. 

Example 1 (Test of uniformity): We consider a set A with 
I elements. The set has no particular structure so we use 



Hamming distortion as distortion function. Our null hypothesis 
is P = U where u denotes the uniform distribution on A. In 
this case the Markov kernel "J^o has the form 

^d ■ x — > aS x + (1 — a) U 

for some value a e [0; 1] determined by c? - The Markov ker- 
nel maps the uniform distribution into the uniform distribution. 
Therefore the statistic of the rate distortion test has the form 

D(aEmp n (w) + (1 - a) U\\U) . 

This statistic is closely related to the idea of local alternatives 
often studied in statistics. 

Example 2 (Normality test): We consider the real numbers 
with squared Euclidian distance as distortion function. Our null 
hypothesis is P = $ where $ denotes the standard Gaussian 
distribution. The optimal Markov kernel for the rate distortion 
problem sends x into the distribution of ax + (l — a 2 ) 1 ^ 2 Z 
where Z is a standard Gaussian random variable. We see 
that the Gaussian distribution is mapped into it self. Thus the 
statistic of the rate distortion test is 

D (aX + (1 - a 2 ) 1 ' 2 

where we have identified the random variable 

aX + {1 - a 2 ) 1 ' 2 Z 

with its distribution. This Markov kernel can be rewritten as 

D[aX + (l-a 2 ) 1/2 Z\\^j 

= D^X+{^ 2 -lj 12 Z\\^(Q, a 2 )^ 

so the Markov kernels essentially smooth data by adding an 
independent Gaussian random variable with variance a~ 2 - 1. 
The idea of smoothing data is well-known in statistics. 

Example 3 (Test of uniformity of angular data): In this ex- 
ample we consider data with values on the circle s\ that 
we can identify with R/27rZ. See [6] for references. As 
distortion function we shall use 4 cos ( !h ^ x ) , i.e. squared 
Euclidean distance between points on a circle. We shall test the 
hypothesis P = U where U denotes the uniform distribution 
on the circle. The optimal Markov kernel is a smoothing by 
adding a von Mises distribution 

exp (kcos ($)) 
2irl (k) 

where I is the modified Bessel function of order with 
parameter k determined by the distortion level [7], [8]. The 
Markov kernel maps the uniform distribution into the uniform 
distribution. 

IV. Limits for extreme values of 

Often the rate distortion curve is parametrized by its slope 
0. Here we shall discuss the effect of choosing very small or 
very large values of when the sample is kept fixed. We shall 
go through our three main examples from this point of view. 



Example 4 (Test of uniformity continued): Small or large 
values of corresponds to small or large values of a. For 
a = 1 we get the statistic 

D(Em Pn (w) \\U) 

which is the likelihood ratio test. For a close to we use that 
information divergence is an /-divergence with f (x) = x In x 
so that 



da 



D(aEmp n (w) + (1 - a) U\\U) 

d_ >A 1 (up (i) + (1 - a) \ 



and 



—D(aEmp n (w) + (1 - a) U\\U) 
da 



!\ 2 ,// ( a P (0 + (! - a ) 7 



Thus a second order Taylor expansion gives 

D {aEmp n (uj) + (1 - a) U\\ U) « ^ £ ) (p (i) - ' 

X 2 (Emp n (lo) , U) 
2l 2 

Thus using a small value of a approximately corresponds to 
replace the likelihood ratio test with a \ 2 test. 

Example 5 (Normality test continued): Small or large val- 
ues of corresponds to small or large values of a. If the i'th 
observation is denoted x,-. then 



£>(* do (Emp n H)||* d0 ($)) 

= D (** "** (<f>) ) 

= ^(^E^o (s Xi )\\^ do 
1 n 

i=l 

n I n \ 

+ -J2 D ** (*-.)ii-E** W 

*=i \ i =1 / 
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2 

n I n \ 

+ -E^K (^)ii-E** ■ 
*=i v j =i / 



For large values of a we only smooth a little so the different 
observations smoothed are approximately singular. Thus 



D(H> do (Emp n M)||* d0 ($)) 



+ log n. 



In this case the use of rate distortion statistic is approxi- 
mately equivalent to the use of the statistic — Y^i=i x 1- This 
statistic is sufficient for alternatives in the exponential family 
$ (0, a 2 ) , a > 0. 

For small values of a we use a different expansion. We 
use that D (aX + (1 -a 2 ) 1 ' 2 Z\\^j has a leading term 
determined by the mean value of X. Therefore the statistic 
essentially reduces to ^X)"=i a; i- This statistic is sufficient 
for alternatives in the exponential family $ (/i, 1) . 

Example 6 (Uniformity of angular data continued): Small 
or large values of (3 corresponds to small or large values of 
k. For small values of k we have 
exp(KCOs(#)) 
2irl (k) 

For observations 9\, 62, 9 n the smoothed distribution ap- 
proximately has density 
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The rate distortion statistic will approximately be given by 

C0S „' I . By rotational symmetry the information 

sinfl, J 

divergence does not depend on the direction of the vector 

— y^! 1 1 I C ° S n I • Thus the use of the rate distortion 

n ^%-\ y Sln 0. y 

statistic is essentially equivalent to the use of the statistic 
cos 9 \ 

1 . This is the most used statistic for 
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testing uniformity of angular data. 
We have 

D(^ da (Emp n (uj))\\U) 
1 " 

n i=i 

n / 1 n \ 

+ -E^(**(^)ii-E^o (^)J 

= ^(* d0 (5 )||^) 



n / n \ 

*=i \ i =1 / 



For large values of k the term 

1 n / 1 n \ 

-E D **(^)ii-E** W 

will be dominated by the pair (9i,9j) ,i ^ j for which 
cos (9i — 9j ) is maximal. 

V. Hodge and Lehman efficiency 

For testing uniformity with Hamming distortion we see that 
if we do not compress data (a = 1) the rate distortion test gives 
the statistic D (Emp n (w) \\U) which is known to be Bahadur 
efficient for testing uniformity. This is in general not the case. 
For a rate distortion test of normality little compression gives 
a statistic that is efficient for Gaussian alternatives with mean 
zero and variance different from 1, but it is obviously not 
efficient against other alternatives with mean and variance 1 . 
Similarly the rate distortion test of uniformity of angular data 
depends of the maximal value of cos (#j — 9f) ,i ^ j but not 
on values of all other observed angles which is obviously not 
efficient. So the question is how much one should to compress 
in order to get an efficient test against any alternative. 

There are several ways of measuring efficiency among 
which the following are most important. In this short note it 
is neither possible to give all definitions nor proofs in details. 

Hodge and Lehman efficiency An alternative hypothesis 
and a significance level are fixed. One is interested in the 
sample size that is needed to achieve a certain large power of 
the test. 

Bahadur efficiency An alternative hypothesis and a power 
level are fixed. One is interested in the sample size that is 
needed to achieve a certain small significance level of the test. 

Pitman efficiency The alternative is moved closer when 
the sample size is increased. This is done in a way so that the 
power of the test is constant. One is interested in the sample 
size that is needed to achieve a certain fixed significance level 
of the test. 

The Hodge and Lehman efficiency is often the easiest to 
calculate but most tests are equally efficient in this sense. 
More tests can be distinguished by their Pitman efficiency. 
The Bahadur efficiency is often the most sensitive and at the 
same time often the hardest to calculate. 

Theorem 7: Assume that the space is compact and that 
the distortion function is continuous. Let d n denote a decreas- 
ing sequence of distortion values. Assume that Q generates 
data. Then 

Q (D (* dn (Emp n (w)) H*,,,, (Q)) > e) -> for n -> 00 

if d n tends to sufficiently slowly. 

Proof: It is sufficient to show that 



<{D{V 5 (Emp n (w))||* tf (Q)) >e) 







for any fixed distortion level 5 > 0. Weak convergence means 
that Empn (u) converges to Q in the Wasserstein sense. 
Continuity of the distortion function d implies that ^,5 is weak 
continuous on the set of probability measures. ■ 



Theorem 8: Let d n denote a decreasing sequence of distor- 
tion values. Assume that Q generates data. Then 

Iiminf£)(* dn (Em Pn («)) ||* dn (P)) > £>(Q||P) 

almost surely. 

Proof: This follows by lower semi-continuity of informa- 
tion because 9 dn (Emp„ (u>)) tends to Q and \&d„ (P) tends 
to P in the weak topology. ■ 
If P denotes an alternative to a nul-hypothesis Q then 
according to Sanov's theorem for a fixed significance level the 
best achievable type 2 error decreases like exp (~nD (Q\\P)) ■ 
The two previous theorems together implies that the rate- 
distortion test on a compact set with a continuous distortion 
function achieves the same exponential decrease in type 2 
error. Hence, the rate distortion test is efficient in the sense of 
Hodge and Lehman. 

VI. Bahadur efficiency 

We shall analyze this question in the case of testing uni- 
formity of angular data because this is of particular simplicity 
because angles can be identified with elements of SO (2). 

Theorem 9: Let d n denote a decreasing sequence of distor- 
tion values. Assume that P generates data. Then 

liminf D (* d „ (Emp n (to)) \\U) > D (P\\U) 

almost surely. 

Proof: The proof is essentially the same as the proof of 
Theorem [8] ■ 
The theorem implies that for any K < D (P\\U) we have 
D (^d„ (Emp n (lu)) \\U) > K eventually almost surely so if 
P is the distribution of the alternative hypothesis then and the 
power of the test is kept fixed, then the acceptance regions of 
alternative P in the rate distortion test must have the form 
D(* dn (Emp n (w))\\U) > K n for K n -> D (P\\U) . In 
order to determine the Bahadur efficiency we have to bound 
the probability of D (^> dn (Emp n (to)) \\U) > K n under the 
null hypothesis that data are generated by a uniform distribu- 
tion. Now partition the set of angles [0;27r[ into k n intervals 
of length 2n/k n . We choose k n such that fcl " fc — > oo for 
n — > oo. Let P„ denote the cr-algebra generated by these 
intervals. Then 

lim-i Pr (d (Emp n (u) ]Fn \\U [Fn } > = D (P\\U) . 
We are interested in 

D (* d „ (Em Pn (lu)) \\U)=D (* dn (Em Pn (lu)) (U)) 

and not D ^Emp n ll^|-F„) but each subinterval has 

length 27r/fc„ so 

dEmp n (uj), f 



dV dn (Emp n (lu)) 
lo S ^ 7TT\ lo S ■ 



d^d n (U) 



dU\ 



< 



log 



GXp(K n COS(O)) 

2tt/o(k) 



CXp(K n COS(0)) 

= K n |cos (0) — cos (2n/k n ) \ . 



Therefore 

lim-i Pr(P (V dn (Emp n (lu)) \\U) > K n ) - D (P\\U) 
n 

if the the test is Bahadur efficient if 

n n |cos (0) — cos (2-K/k n )\ 







for n — > oo. An expansion of cosine around shows that the 
condition is equivalent to 

nkl 



for n — > oo. 



If we choose k n = n 1 where 7 < 1 we get the sufficient 
condition 

. " > for n — > 00. 

n l+2 7 

This leads us to the following theorem. 

Theorem 10: The rate distortion test of uniformity of an- 
gular data has smoothing by a von Mises distribution with 
parameter K n . If K n — » 00 for n — > 00 and there exist 
rj G [1; 3[ such that 

— — > for n —* 00, 

then the rate distortion test is Bahadur efficient. 

The method sketched here can be extended to prove Bahadur 
efficiency of rate distortion test of the uniformity on compact 
groups [7], [8]. 

VII. Discussion 

A new statistical test is proposed. It is based on a rate 
distortion function. By specifying the distortion function one 
does not have to divide the data into bins as this is build 
into the test. We have discussed the test in detail for a few 
examples. The example with testing uniformity of angular data 
can be extended to compact groups. There is no standard 
procedure for testing uniformity on a group, but there are 
many competing tests for the Gaussian distribution. In [9], 
[10] and [11] it has been shown by simulations that tests 
based on estimation entropy are more powerful than many 
other test for normality that one can find in the literature. The 
author has done some simulation to compare these tests with 
the test proposed here. These simulations indicates that the 
rate distortion test has a good power, but these results are still 
preliminary and will not be presented in this short note. 

We saw that the rate distortion test has good Bahadur 
efficiency for angular data. We conjecture that the proposed 
test has high Bahadur efficiency in any case where it can 
be applied. It is not clear how to formulate this conjecture 
precisely, and it may be hard to prove because the rate 
distorting function normally cannot be calculated exactly. 

A nice feature about the rate distortion test is that one can 
get a clear understanding of the effect of very small or very 
large compression. In our examples very small or very large 
compression in the rate distortion test corresponds to other 
familiar test like x 2 -t es ti n g> and this may actually be used 
to give new interpretations of these tests. This is in contrast 



with the common approach via discretizations. It is simply 
difficult to analyze the effect of discretize data into very few 
bins because 2 gives an absolute lower bound on how many 
bins one can use if the analysis should not become trivial. 

Another conjecture that has been supported by numerical 
calculations is that the rate distortion statistics is asymp- 
totically Gaussian. As it is now we have to Monte Carlo 
simulate the rate distortion statistics, and each simulation 
involves a numerical calculation of the rate distortion function. 
If it can be proved that the distribution of the rate distortion 
statistics is asymptotically Gaussian it means that the number 
of simulations can be reduces significantly because one just 
has to estimate mean and variance in order to be able to 
calculate the critical value for a specified significance level. 

In this paper some simple examples where the rate distor- 
tion function can be calculated exactly, have been discussed. 
There are other examples than these where the rate distortion 
function can be calculated exactly. One interesting example is 
the Poisson process discussed in [12]. The setup is slightly 
different than the one presented here and therefore we cannot 
discuss it in this short paper. Nevertheless the ideas presented 
in this paper can be used to construct a test of whether a 
random process is a Poisson process. Contrary to the examples 
discussed in this paper this test of the Poisson process is 
completely new in the sense that it does not relate to any 
established statistical test. 
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